An XML-based approach to 
dialectological data: The 
development of syllabic liquids 

in Bulgarian 
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To what extent do the prosodic analyses of Trt 
groups in standard Bulgarian characterize the 
dialects of Bulgaria? 
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oub-questions 


e What is the role and nature of lexical diffusion in this 
process? 
o Just to clarify...by lexical diffusion we do not mean a 
non-Neogrammarian sound change. 
o Chronology: 
1. Sound change(s). 
2. Diffusion of tokens bearing various reflexes. 


Why XML? 


e Bulgarian Dialect Atlas (BDA) contains a /ot of information 
pertaining to this...possibly too much (at first glance)! 
o Raw data lists are extremely difficult to process. 
o Maps are helpful, but impressionistic. 
e XML (Extensible Markup Language) allows bottom-up 
rebuilding of the data set. 
o Instead of just word lists, data can be sorted and counted 
according to various criteria. 
o Maps can be regenerated to reflect various ways of 
sorting the data. 


Printed edition vs. XML 
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«token trt="pb" Inum="13">Kpbcdb</token> 
«token trt="pb" Inum="16">npbe</token> 
«token trt="pb" Inum="35">upbab</token> 
<token trt="p" Inum="5">rpn</token> 
<token trt="p" Inum="16">npc</token> 
«token trt="bp" Inum="20">cbpn</token> 

</map> 

</site> 


Atlas data in XML 


«site loc="NW'"> 
<site_number>655</site_number> 
«site location?» 

<longitude>23.349365</longitude> 

<latitude>43.387262</latitude> 
</site_location> 
<site_name>Cty6en</site_name> 


site = each site in the atlas 
@loc = region (ie, atlas volume) 
site_number = standard site number 
used in the atlas 
site_location = container for longitude 


<site_region>MuxaiinosrpamcKo</site_region>and latitude 


<map> 


«token trt="pb" Inum="5">rpbn</token> 
«token trt="pb" Inum="9">Kpbk</token> 
«token trt="pb" Inum="13">Kpbdb</token> 
«token trt="pb" Inum="16">npbce</token> 
«token trt="pb" Inum="35">uyupbc</token> 
<token trt="p" Inum="5">rpn</token> 
<token trt="p" Inum="16">npc</token> 
«token trt="bp" Inum="20">cbpn</token> 


</map> 


longitude = longitude of site 
latitude = latitude of site 
site name - name of site 
site region - region of site 
map = container for tokens 
token - the word as printed in the 
atlas 
(trt = the TrT value for the token 
@/num = a standard number created for 
the atlas to represent the lexeme 


Lexeme index in XML 


<lexeme> 
<word>rPn</word> 
<number>5</number> 
<token trt="ap" Inum="5">rapn</token> 
«token trt="bp" Inum="5">rbpn</token> 
«token trt="pb" Inum="5">rppn</token> 
<token trt="ep" Inum="5">repn</token> 
<token trt="ap" Inum="5">rapn</token> 
</lexeme> 


<lexeme> 
<word>rPc</word> 
<number>6</number> 
«token trt="pb" Inum="6">rpbe</token> 
«token trt="op" Inum="6">ropc</token> 
«token trt="bp" Inum="6">rbpc'</token> 
</lexeme> 


lexeme = container for data 
relevant to each underlying 
"word" 

word - (constructed) 
etymology, using P to stand 
in for the liquid 

number - standard number 
to identify lexemes; identical 
to @/num for each token 
token - the word as printed 
in the atlas 


@irt = the TrT value for the 
token 


Behind the scenes NETT es 


«atlas? 
<site> 
<site_number>9</site_number> 
<site_location> 
<longitude>22.74344</longitude> 
<latitude>44.051005</latitude> 
</site_location> 
<site_name>Mnakygep</site_name> 
<site_region>Buguucko</site_region> 
<map mnum="107-4" data="trt1"> 
«token trt="p" Inum="5">rpn</token> 
«token trt="p" Inum="10">Kpc</token> 
«token trt="p" Inum="13">Kpcp</token> 
«token trt="p" Inum="16">npc</token> 
«token trt="p" Inum="18">npy</token> 
<token trt="p" Inum="20">cpn</token> 
«token trt="p" Inum="34">upH</token> 
</map> 
<index> 
<lexeme> 
<word>6Pc</word> 
<number>1</number> 
«token trt="pb" Inum="1">6pbc</token> 
«token trt="bp" Inum="1">6bpc</token> 
</lexeme> 
</index> 
</atlas> 
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XSLT 


<xsl:stylesheet 
xmins:xsl="http://www.w3.org/1999/XSL/Transf 
orm" version="2.0"> 
<xsl:import href="site_template.xsl"/> 
«xsl:key name="aword" match="site_name" 
use="../map/reflex/token"/> Bulgarian Dialectological Atlas 


t 


a tool for ani 


<xsl:template match="atlas"> 
<div id="alphabetical"> 
<h3>Alphabetical</h3> 
<ul> 
<xsl:for-each 

select="index/lexeme"> 

<xsl:sort select-"word" 
order="ascending"/> 
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<li><a 
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</div> 
</xsl:template> 
</xsl:stylesheet> 
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Site view 
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- Where a lexeme 
displays multiple reflexes, 
those lexemes and the 


== tokens are identified; both 


are clickable for more 
detail 

- A list of all tokens from 
the site is available; all 
tokens and reflexes are 
clickable for more detail 
- A map shows the 
location of the site 
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Sites that have p» also have... 


9 »p 
508 sites - 57.796 of p» sites have bp 


Sites "with. at least one Db irm 


1471 (78 unique) bp tokens co-occur with pb 
p 

252 sites - 28.6% of p» sites have p 

1597 (217 unique) p tokens co-occur with pb 


- Acount of all the 
tokens with the 
reflex, all the sites 
with the reflex, 
what % of all sites 
have the reflex, 
and what % of sites 
only have the reflex 
- Toggle-down lists 
of sites with the 
reflex for each 
region 

- What reflexes 
co-occur with the 
reflex, and with 
what frequency 


Token view 
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e= ‘more than 50% of the site's tokens 
have the same reflex as the current token. 
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For those dialects that do not parallel 
the standard language, for how many 
does the distribution of TrT reflexes 
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Of those dialects that do not parallel the 
standard language, for how many does 
the distribution of TrT reflexes mostly 
follows a regular distribution with the 
intrusion of discordant lexemes? 


Here defined as "sites where the reflex with the most number of 
tokens appears in 75-99% of the tokens in that site". 


249 (20%) 


Is lexical diffusion basically random, or 
do some words tend to diffuse more? 


e MANY different possible metrics to get at this. 
e Lexemes are attested with 1-16 discrete reflexes; what 
conditions this? 
o Chance: # of attested reflexes is strongly correlated with # 
of attested locations; r = .8568, p < .0001. 
e How often are certain lexemes is the bearer of a unique tRt 
reflex at some geographic point? 
o # of unique tRt reflexes varies from O to 32. 
o # of unique tRt reflexes is strongly correlated with # of 
attested locations; r = .8949, p < .0001. 
e Lexical diffusion seems to be basically random. 
o This agrees with impressionistic assessments... 
o ...but would be difficult to prove based on the atlas alone. 


Conclusions 


e XML markup of pre-existing data set allows a much more 
nuanced application that would otherwise be possible. 
o This enables answering linguistic questions that would 
otherwise be near-intractable. 
o Suggests ways to maximize utility of scholarly heritage. 
e Problems / Future Steps: 
o Incomplete / inconsistent data across volumes. 
m e.g., ‘generally X, but here's some Y" for polysyllables. 
o What quantitative metrics to apply to the data? 
m Incorporation of geographic data 
m Similarity metrics to compare geographic points, the 
geographic distribution of reflexes, etc. 
o Research questions similar, but orthogonal to Buldialect 
project (Osenova et al. 2007, Heeringa et al. 2010). 
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Sources for XML and XSLT information: on handout. 


